Conversation
The function used to perform custom reduction in a single work-item (leader of the work-group sequentially). It now does so cooperatively few iterations, and processes remaining non-reduced elements sequentially in the leading work-item. The custom_reduce_over_group got sped up about a factor of 3x. The following now shows timing of the reduction kernel ``` unitrace -d -v -i 20 python -c "import dpctl.tensor as dpt; dpt.min(dpt.ones(10**7, dtype=dpt.float32)).sycl_queue.wait()" ``` or par (less that 10%) slower than the int32 kernel, which uses built-in sycl::reduce_over_group: ``` unitrace -d -v -i 20 python -c "import dpctl.tensor as dpt; dpt.min(dpt.ones(10**7, dtype=dpt.int32)).sycl_queue.wait()" ```
|
Deleted rendered PR docs from intelpython.github.com/dpctl, latest should be updated shortly. 🤞 |
|
Array API standard conformance tests for dpctl=0.19.0dev0=py310h93fe807_324 ran successfully. |
Doing so exactly recovers the behavior of sub_group::load<vec_sz>, sub_group::store<vec_sz> and eliminates warnings with 2025.1 and SYCLOS. With this change, enable use of group_load, group_store for DPC++ compiler with `__SYCL_MAJOR_VERSION >= 8u` which includes oneAPI DPC++ 2025.0.x compiler and SYCLOS bundle.
e8c10f1 to
03910f3
Compare
|
Array API standard conformance tests for dpctl=0.19.0dev0=py310h93fe807_324 ran successfully. |
Factor out bounds as constexpr values, reused between power-of-2 branch and not-power-of-two branch. Lowered lower bounds from 32 to 8 based on pefrormance testing on PVC and Iris Xe.
…_over_group function
|
Array API standard conformance tests for dpctl=0.19.0dev0=py310h93fe807_325 ran successfully. |
|
Array API standard conformance tests for dpctl=0.19.0dev0=py310h93fe807_326 ran successfully. |
ndgrigorian
left a comment
There was a problem hiding this comment.
This brings a great performance improvement for reductions not using sycl::reduce_over_group.
This LGTM, thank you @oleksandr-pavlyk !
The function used to perform custom reduction in a single
work-item (leader of the work-group sequentially).
It now does so cooperatively few iterations, and
processes remaining non-reduced elements sequentially
in the leading work-item.
The custom_reduce_over_group got sped up about a factor of 3x.
The following now shows timing of the reduction kernel
or par (less that 10%) slower than the int32 kernel, which uses
built-in sycl::reduce_over_group:
Also fixed uses of group_load/group_store and enabled them for 2025.0 compiler. SYCLOS warnings should be gone now.